Welcome to part A of the Frequentist inference case study! The purpose of this case study is to help you apply the concepts associated with Frequentist inference in Python. Frequentist inference is the process of deriving conclusions about an underlying distribution via the observation of data. In particular, you'll practice writing Python code to apply the following statistical concepts:
how to combine these concepts to calculate a confidence interval
To be able to complete this notebook, you are expected to have a basic understanding of:
Happily, these should all be concepts with which you are reasonably familiar after having read ten chapters of Professor Spiegelhalter's book, The Art of Statistics.
We'll try to relate the concepts in this case study back to page numbers in The Art of Statistics so that you can focus on the Python aspects of this case study. The second part (part B) of this case study will involve another, more real-world application of these tools.
For this notebook, we will use data sampled from a known normal distribution. This allows us to compare our results with theoretical expectations.
First, let's explore the ways we can generate the normal distribution. While there's a fair amount of interest in sklearn within the machine learning community, you're likely to have heard of scipy if you're coming from the sciences. For this assignment, you'll use scipy.stats to complete your work.
This assignment will require some digging around and getting your hands dirty (your learning is maximized that way)! You should have the research skills and the tenacity to do these tasks independently, but if you struggle, reach out to your immediate community and your mentor for help.
from scipy.stats import norm
from scipy.stats import t
import numpy as np
import pandas as pd
from numpy.random import seed
import matplotlib.pyplot as plt
Q1: Call up the documentation for the norm
function imported above. (Hint: that documentation is here). What is the second listed method?
A:
It's pdf(x, loc=0, scale=1), Probability density function.
Q2: Use the method that generates random variates to draw five samples from the standard normal distribution.
A:
seed(47)
# draw five samples here
r = norm.rvs(size=5)
plt.plot(r,marker='.',linestyle='none');
Q3: What is the mean of this sample? Is it exactly equal to the value you expected? Hint: the sample was drawn from the standard normal distribution. If you want a reminder of the properties of this distribution, check out p. 85 of AoS.
A:
# Calculate and print the mean here, hint: use np.mean()
mean = np.mean(r)
mean
Q4: What is the standard deviation of these numbers? Calculate this manually here as $\sqrt{\frac{\sum_i(x_i - \bar{x})^2}{n}}$ (This is just the definition of standard deviation given by Professor Spiegelhalter on p.403 of AoS). Hint: np.sqrt() and np.sum() will be useful here and remember that numPy supports broadcasting.
A:
np.sqrt(np.sum((r-mean)**2)/ len(r))
np.std(r)
# reference
r.std(
# ddof=0
)
Here we have calculated the actual standard deviation of a small data set (of size 5). But in this case, this small data set is actually a sample from our larger (infinite) population. In this case, the population is infinite because we could keep drawing our normal random variates until our computers die!
In general, the sample mean we calculate will not be equal to the population mean (as we saw above). A consequence of this is that the sum of squares of the deviations from the population mean will be bigger than the sum of squares of the deviations from the sample mean. In other words, the sum of squares of the deviations from the sample mean is too small to give an unbiased estimate of the population variance. An example of this effect is given here. Scaling our estimate of the variance by the factor $n/(n-1)$ gives an unbiased estimator of the population variance. This factor is known as Bessel's correction. The consequence of this is that the $n$ in the denominator is replaced by $n-1$.
You can see Bessel's correction reflected in Professor Spiegelhalter's definition of variance on p. 405 of AoS.
Q5: If all we had to go on was our five samples, what would be our best estimate of the population standard deviation? Use Bessel's correction ($n-1$ in the denominator), thus $\sqrt{\frac{\sum_i(x_i - \bar{x})^2}{n-1}}$.
A:
np.sqrt(np.sum((r-mean)**2)/ (len(r)-1))
# reference
r.std(ddof=1)
Q6: Now use numpy's std function to calculate the standard deviation of our random samples. Which of the above standard deviations did it return?
A:
np.std(r)
Numpy std function has returned the former one.
Q7: Consult the documentation for np.std() to see how to apply the correction for estimating the population parameter and verify this produces the expected result.
A:
np.std(r), np.std(r,ddof=1)
The two values did not exactly match, which concludes that the sample is not unbiased.
In this section, you've been introduced to the scipy.stats package and used it to draw a small sample from the standard normal distribution. You've
You use $n$ as the denominator if you want to calculate the standard deviation of a sequence of numbers. You use $n-1$ if you are using this sequence of numbers to estimate the population parameter. This brings us to some terminology that can be a little confusing.
The population parameter is traditionally written as $\sigma$ and the sample statistic as $s$. Rather unhelpfully, $s$ is also called the sample standard deviation (using $n-1$) whereas the standard deviation of the sample uses $n$. That's right, we have the sample standard deviation and the standard deviation of the sample and they're not the same thing!
The sample standard deviation \begin{equation} s = \sqrt{\frac{\sum_i(x_i - \bar{x})^2}{n-1}} \approx \sigma, \end{equation} is our best (unbiased) estimate of the population parameter ($\sigma$).
If your dataset is your entire population, you simply want to calculate the population parameter, $\sigma$, via \begin{equation} \sigma = \sqrt{\frac{\sum_i(x_i - \bar{x})^2}{n}} \end{equation} as you have complete, full knowledge of your population. In other words, your sample is your population. It's worth noting that we're dealing with what Professor Spiegehalter describes on p. 92 of AoS as a metaphorical population: we have all the data, and we act as if the data-point is taken from a population at random. We can think of this population as an imaginary space of possibilities.
If, however, you have sampled from your population, you only have partial knowledge of the state of your population. In this case, the standard deviation of your sample is not an unbiased estimate of the standard deviation of the population, in which case you seek to estimate that population parameter via the sample standard deviation, which uses the $n-1$ denominator.
Great work so far! Now let's dive deeper.
So far we've been dealing with the concept of taking a sample from a population to infer the population parameters. One statistic we calculated for a sample was the mean. As our samples will be expected to vary from one draw to another, so will our sample statistics. If we were to perform repeat draws of size $n$ and calculate the mean of each, we would expect to obtain a distribution of values. This is the sampling distribution of the mean. The Central Limit Theorem (CLT) tells us that such a distribution will approach a normal distribution as $n$ increases (the intuitions behind the CLT are covered in full on p. 236 of AoS). For the sampling distribution of the mean, the standard deviation of this distribution is given by
\begin{equation} \sigma_{mean} = \frac{\sigma}{\sqrt n} \end{equation}where $\sigma_{mean}$ is the standard deviation of the sampling distribution of the mean and $\sigma$ is the standard deviation of the population (the population parameter).
This is important because typically we are dealing with samples from populations and all we know about the population is what we see in the sample. From this sample, we want to make inferences about the population. We may do this, for example, by looking at the histogram of the values and by calculating the mean and standard deviation (as estimates of the population parameters), and so we are intrinsically interested in how these quantities vary across samples.
In other words, now that we've taken one sample of size $n$ and made some claims about the general population, what if we were to take another sample of size $n$? Would we get the same result? Would we make the same claims about the general population? This brings us to a fundamental question: when we make some inference about a population based on our sample, how confident can we be that we've got it 'right'?
We need to think about estimates and confidence intervals: those concepts covered in Chapter 7, p. 189, of AoS.
Now, the standard normal distribution (with its variance equal to its standard deviation of one) would not be a great illustration of a key point. Instead, let's imagine we live in a town of 50,000 people and we know the height of everyone in this town. We will have 50,000 numbers that tell us everything about our population. We'll simulate these numbers now and put ourselves in one particular town, called 'town 47', where the population mean height is 172 cm and population standard deviation is 5 cm.
seed(47)
pop_heights = norm.rvs(172, 5, size=50000)
_ = plt.hist(pop_heights, bins=30)
_ = plt.xlabel('height (cm)')
_ = plt.ylabel('number of people')
_ = plt.title('Distribution of heights in entire town population')
_ = plt.axvline(172, color='r')
_ = plt.axvline(172+5, color='r', linestyle='--')
_ = plt.axvline(172-5, color='r', linestyle='--')
_ = plt.axvline(172+10, color='r', linestyle='-.')
_ = plt.axvline(172-10, color='r', linestyle='-.')
Now, 50,000 people is rather a lot to chase after with a tape measure. If all you want to know is the average height of the townsfolk, then can you just go out and measure a sample to get a pretty good estimate of the average height?
def townsfolk_sampler(n):
return np.random.choice(pop_heights, n)
Let's say you go out one day and randomly sample 10 people to measure.
seed(47)
daily_sample1 = townsfolk_sampler(10)
_ = plt.hist(daily_sample1, bins=10)
_ = plt.xlabel('height (cm)')
_ = plt.ylabel('number of people')
_ = plt.title('Distribution of heights in sample size 10')
The sample distribution doesn't resemble what we take the population distribution to be. What do we get for the mean?
np.mean(daily_sample1)
And if we went out and repeated this experiment?
daily_sample2 = townsfolk_sampler(10)
np.mean(daily_sample2)
Q8: Simulate performing this random trial every day for a year, calculating the mean of each daily sample of 10, and plot the resultant sampling distribution of the mean.
A:
seed(47)
# take your samples here
ls = []
for day in range(365):
daily_sample = townsfolk_sampler(10)
mean = np.mean(daily_sample)
ls.append(mean)
_ = plt.hist(ls, bins='auto')
_ = plt.xlabel('height (cm)')
_ = plt.ylabel('count of day mean')
_ = plt.title('Distribution of heights')
The above is the distribution of the means of samples of size 10 taken from our population. The Central Limit Theorem tells us the expected mean of this distribution will be equal to the population mean, and standard deviation will be $\sigma / \sqrt n$, which, in this case, should be approximately 1.58.
?? which n is it?
Q9: Verify the above results from the CLT.
A:
np.mean(ls), 172, np.std(ls),5/np.sqrt(10)
Remember, in this instance, we knew our population parameters, that the average height really is 172 cm and the standard deviation is 5 cm, and we see some of our daily estimates of the population mean were as low as around 168 and some as high as 176.
Q10: Repeat the above year's worth of samples but for a sample size of 50 (perhaps you had a bigger budget for conducting surveys that year)! Would you expect your distribution of sample means to be wider (more variable) or narrower (more consistent)? Compare your resultant summary statistics to those predicted by the CLT.
A:
seed(47)
# calculate daily means from the larger sample size here
ls = []
for day in range(365):
daily_sample = townsfolk_sampler(50)
mean = np.mean(daily_sample)
ls.append(mean)
_ = plt.hist(ls, bins='auto')
_ = plt.xlabel('height (cm)')
_ = plt.ylabel('count of day mean')
_ = plt.title('Distribution of heights')
np.mean(ls), 172, np.std(ls),5/np.sqrt(50)
What we've seen so far, then, is that we can estimate population parameters from a sample from the population, and that samples have their own distributions. Furthermore, the larger the sample size, the narrower are those sampling distributions.
All of the above is well and good. We've been sampling from a population we know is normally distributed, we've come to understand when to use $n$ and when to use $n-1$ in the denominator to calculate the spread of a distribution, and we've seen the Central Limit Theorem in action for a sampling distribution. All seems very well behaved in Frequentist land. But, well, why should we really care?
Remember, we rarely (if ever) actually know our population parameters but we still have to estimate them somehow. If we want to make inferences to conclusions like "this observation is unusual" or "my population mean has changed" then we need to have some idea of what the underlying distribution is so we can calculate relevant probabilities. In frequentist inference, we use the formulae above to deduce these population parameters. Take a moment in the next part of this assignment to refresh your understanding of how these probabilities work.
Recall some basic properties of the standard normal distribution, such as that about 68% of observations are within plus or minus 1 standard deviation of the mean. Check out the precise definition of a normal distribution on p. 394 of AoS.
Q11: Using this fact, calculate the probability of observing the value 1 or less in a single observation from the standard normal distribution. Hint: you may find it helpful to sketch the standard normal distribution (the familiar bell shape) and mark the number of standard deviations from the mean on the x-axis and shade the regions of the curve that contain certain percentages of the population.
A:
Calculating this probability involved calculating the area under the curve from the value of 1 and below. To put it in mathematical terms, we need to integrate the probability density function. We could just add together the known areas of chunks (from -Inf to 0 and then 0 to $+\sigma$ in the example above). One way to do this is to look up tables (literally). Fortunately, scipy has this functionality built in with the cdf() function.
np.sum(pop_heights<=172+5)/len(pop_heights)
Q12: Use the cdf() function to answer the question above again and verify you get the same answer.
A:
cdf = norm.cdf(1)
cdf
Q13: Using our knowledge of the population parameters for our townsfolks' heights, what is the probability of selecting one person at random and their height being 177 cm or less? Calculate this using both of the approaches given above.
A:
# sum of pop
np.sum(pop_heights<=177)/len(pop_heights)
calculating the area under the curve from the value of z
z = (177-172)/5 = 1
area = 50 + 34.1 = 84.1
# Cumulative Desity Function
norm.cdf((177-172)/5)
Q14: Turning this question around — suppose we randomly pick one person and measure their height and find they are 2.00 m tall. How surprised should we be at this result, given what we know about the population distribution? In other words, how likely would it be to obtain a value at least as extreme as this? Express this as a probability.
A:
np.sum((pop_heights>=199.6) & (pop_heights<200.6))/len(pop_heights)
1-norm.cdf((200-172)/5)
What we've just done is calculate the p-value of the observation of someone 2.00m tall (review p-values if you need to on p. 399 of AoS). We could calculate this probability by virtue of knowing the population parameters. We were then able to use the known properties of the relevant normal distribution to calculate the probability of observing a value at least as extreme as our test value.
We're about to come to a pinch, though. We've said a couple of times that we rarely, if ever, know the true population parameters; we have to estimate them from our sample and we cannot even begin to estimate the standard deviation from a single observation.
This is very true and usually we have sample sizes larger than one. This means we can calculate the mean of the sample as our best estimate of the population mean and the standard deviation as our best estimate of the population standard deviation.
In other words, we are now coming to deal with the sampling distributions we mentioned above as we are generally concerned with the properties of the sample means we obtain.
Above, we highlighted one result from the CLT, whereby the sampling distribution (of the mean) becomes narrower and narrower with the square root of the sample size. We remind ourselves that another result from the CLT is that even if the underlying population distribution is not normal, the sampling distribution will tend to become normal with sufficiently large sample size. (Check out p. 199 of AoS if you need to revise this). This is the key driver for us 'requiring' a certain sample size, for example you may frequently see a minimum sample size of 30 stated in many places. In reality this is simply a rule of thumb; if the underlying distribution is approximately normal then your sampling distribution will already be pretty normal, but if the underlying distribution is heavily skewed then you'd want to increase your sample size.
Q15: Let's now start from the position of knowing nothing about the heights of people in our town.
A:
seed(47)
# take your sample now
random_folk = townsfolk_sampler(50)
# population mean, population std
print('(population mean, population std) =')
np.mean(pop_heights),np.std(pop_heights,ddof=0)
# the (95%) margin of error (== MOE)
# 97.5% percentile,
z = np.round(norm.ppf(0.975),2)
# population mean
population_m = np.mean(pop_heights)
# population std
population_std = np.std(pop_heights,ddof=0)
# sample size
sampleSize = len(random_folk)
MOEz = z * population_std/np.sqrt(sampleSize)
MOEz
print(f'MOE is {MOEz}')
# sample mean
m = np.mean(random_folk)
_ = plt.hist(random_folk, bins='auto')
_ = plt.xlabel('height (cm)')
_ = plt.ylabel('number of people')
_ = plt.title('Distribution of heights in entire town population')
_ = plt.axvline(m, color='b')
_ = plt.axvline(m+MOEz, color='b', linestyle='--')
_ = plt.axvline(m-MOEz, color='b', linestyle='--')
_ = plt.axvline(172, color='r')
print('H0 : sample mean == population mean,\
\nH1: sample mean != population mean,\
\nalpha=0.05\n')
print('z-test')
print(f'z value = {(m-population_m) / (population_std/np.sqrt(sampleSize))}')
print(f'z0 = {z}')
print(f'\nMOE with z-test is {round(MOEz,2)}cm')
print(f'\nConfidence intervals for the mean: (lower,upper) = \n{m-MOEz,m+MOEz}')
print(f'\nIs the population mean, 172cm, within the CI? => {m-MOEz < 172 and 172 < m+MOEz}')
Q16: Above, we calculated the confidence interval using the critical z value. What is the problem with this? What requirement, or requirements, are we (strictly) failing?
A:
Requirement for z value is that we have to know the population parameters such as true population distribution. However, we usually do not know that before the calcualtion. This is the problem when we are using critical z value.
The requirement we are failing is that the sample size of 50 has not been confirmed to satisfify the requirement mentioned above. The sample size should have been enough to satisfy that Confidence interval includes true population mean.
Q17: Calculate the 95% confidence interval for the mean using the t distribution. Is this wider or narrower than that based on the normal distribution above? If you're unsure, you may find this resource resouceuseful. For calculating the critical value, remember how you could calculate this for the normal distribution using norm.ppf().
A:
# Calculate the 95% Confidence Interval of the mean (confidence intervals are defined on p. 385 of AoS)
# (== CI)
# 100% - 95% = 5%
alpha = 0.05
# degree of freedom
dof = sampleSize - 1
# probability point function for t
t_ = t.ppf(1-alpha/2,dof)
# ref:
# https://www.statisticshowto.com/probability-and-statistics/confidence-interval/
# Unbiased estimate of population variance
# (sample standard deviation)
sampleStd = np.std(random_folk,ddof=1)
# Standard Error
SE = sampleStd/np.sqrt(sampleSize)
# CI with sample std
MOEt = t_ * SE
# print(sampleStd,population_std)
print('H0 : sample mean == population mean,\
\nH1: sample mean != population mean,\
\nalpha=0.05\n')
# Unbiased estimate of population variance
# (sample standard deviation)
print(f'Unbiased estimate of population variance = {sampleStd}')
print(f'Satandard Error = {SE}')
print('\nt-test')
print(f't value = {(m-population_m) / SE}')
print(f't0 = {t_}')
print(f'\nMOE with t-test is {round(MOEt,2)}cm')
print(f'\nConfidence intervals for the mean: (lower,upper) = \n{m-MOEt,m+MOEt}')
print(f'\nIs the population mean, 172cm, within the CI? => {m-MOEt < 172 and 172 < m+MOEt}')
# sample mean
m = np.mean(random_folk)
_ = plt.hist(random_folk, bins='auto')
_ = plt.xlabel('height (cm)')
_ = plt.ylabel('number of people')
_ = plt.title('Distribution of heights in entire town population')
_ = plt.axvline(m, color='b')
_ = plt.axvline(m+MOEt, color='b', linestyle='--')
_ = plt.axvline(m-MOEt, color='b', linestyle='--')
_ = plt.axvline(172, color='r')
print(f'\nIs the population mean, 172cm, within the CI? => {m-MOEt < 172 and 172 < m+MOEt}')
For the same 95% confidence criteria, t distribution has wider interval than the normal distribution (z distribution) due to the degree of freedom (dof).
However, actual sample distribution based on the t distribution and standard error (SE) has narrower interval because of small sample standard deviation (sampleStd).
===The end of my answer===
**memo
This is slightly wider than the previous confidence interval. This reflects the greater uncertainty given that we are estimating population parameters from a sample.
Having completed this project notebook, you now have hands-on experience: